Skip to content

fix(ci): ride out transient registry blips in nightly image pulls#191

Merged
jsugg merged 1 commit into
mainfrom
nightly-registry-pull-retry-hardening
Jun 17, 2026
Merged

fix(ci): ride out transient registry blips in nightly image pulls#191
jsugg merged 1 commit into
mainfrom
nightly-registry-pull-retry-hardening

Conversation

@jsugg

@jsugg jsugg commented Jun 17, 2026

Copy link
Copy Markdown
Owner

Summary

The nightly Linux Fresh Host job has been failing intermittently. RCA of the latest failed run (27669020598) showed the ensure-images prepull step exiting 1 with every image dying at the same ~15s wall:

Error response from daemon: Get "https://registry-1.docker.io/v2/": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
Error response from daemon: Get "https://registry-1.docker.io/v2/": net/http: request canceled while waiting for connection

That ~15s is the Docker daemon's own registry HTTP client timeout reaching Docker Hub — i.e. a transient registry connectivity blip, not a daemon fault. The pull helper already retries, but these registry timeouts were not recognized as a distinct transient class, so they fell through to the short generic backoff (2s, 4s) and exhausted the retry budget within seconds — explaining the pass/fail/pass flakiness.

Changes

  • Classify transient registry failures (context deadline exceeded, request canceled while waiting for connection, i/o timeout, tls handshake timeout, DNS errors, …) as a separate retry class from daemon-connectivity faults, in the hosted-docker pull helper.
  • Apply a longer, capped exponential backoff for that class (10s → 20s → 40s → 45s) so a brief upstream outage is ridden out across the retry budget instead of being burned in a few seconds. Generic failures keep the existing short backoff.
  • Raise the Linux fresh-host pull retry budget to 5 so the new backoff has room to work.
  • Added unit coverage asserting the exponential-capped schedule and that the helper retries to success once the registry recovers.

Local: 38 passed, 4 skipped (hosted-docker unit + CI workflow contract suites); ruff lint/format and mypy clean on the changed files.

Note (not addressed here)

The Linux fresh-host job reads the pull knobs from the top-level workflow env, so it ignores the per-caller docker_pull_* inputs (only the macOS job wires them through). Bumping the budget therefore lives in the shared env. Wiring the Linux job to honor per-caller inputs is a reasonable follow-up but is out of scope for this fix.

The Linux nightly fresh-host job failed intermittently when 'docker pull' could not reach registry-1.docker.io ('context deadline exceeded' / 'request canceled while waiting for connection'). These registry network timeouts were not classified as transient, so the pull helper retried them with the short generic backoff (2s, 4s) and burned its retry budget within seconds.

Classify registry connectivity timeouts as a distinct transient failure and apply a longer, capped exponential backoff (10s, 20s, 40s, 45s) so a brief upstream outage is ridden out across the retry budget. Raise the Linux fresh-host pull retry budget to 5 so the backoff has room to work. Add unit coverage for the new backoff schedule.
@jsugg jsugg merged commit 146ca99 into main Jun 17, 2026
33 of 35 checks passed
@jsugg jsugg deleted the nightly-registry-pull-retry-hardening branch June 17, 2026 23:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant